A Study of the Data Provided by the Donors Choose Organization

by Steve Henle
Hazel John
Jim Schlough

Overview and Motivation

Funding the contemporary K-12 classroom is greatly challenging and many teachers believe that resources provided are insufficient in meeting the most basic objectives. Nontraditional support is playing an increasingly important part in supporting the modern classroom.

The Donors Choose web platform provides a mechanism of providing support to teachers by benefactors. A potential donor may read an appeal written by a teacher to fulfill a specific classroom material need, and donates towards their funding goal. If the funding goal is met, these materials are sourced by fulfillment sources under the control of the Donors Choose organization, and sent directly to the school.

Given the changes in public sentiments and commitments towards financial support of the community school over the past few decades, the ability to raise funds directly into the classroom might come to be considered a vital skill of the teacher in supporting classroom activities.

When written appeals succeed by becoming fully funded, or fail by expiring, the data surrounding the appeal is gathered and made publicly available. By analyzing this data, it might be possible to better understand the factors correlated with success or failure of a written appeal. Some factors such as location, date and time, and poverty level are beyond the control of a teacher. Other factors, such as the written content of an appeal, or to a lesser extent, the credentials of the teacher, can be controlled.

We wish to apply statistical analysis of the available data in an effort to go beyond axiomatic and aesthetic beliefs regarding what makes a more or less effective funding proposal.

back to top

Initial Questions

We set forth to answer the following questions regarding the funding proposals for Donors Choose.

What makes the difference between a proposal that is funded and one that expires?
Which are the winning qualities?
Of the predicting qualities, are there any that are under the control of the writer?
Does the content of the written essay matter?

back to top

Data

We are using the data made publicly available on the Donors Choose website. This publicly available comes in the form of downloadable csv files, ranging in size from megabytes to gigabytes. This data can be found at here.

The proposals are one element in the open data set published from the Donors Choose web site. An image of the schema showing the suggested recomposition can be found by clicking the link below, and can also be found at Donors Choose.
back to top
Show or hide source code for rm_old data cleaner

Data Wrangling

Step 1: Data Download & Cleanup

Our first task was to download the data, clean it up and extract the data we needed. We decided to do the analysis with just the data from 2014, so the final task was the filter out unneeded data.

back to top

rm_old: Our First Data Wrangling Utility

Our first plan was to use a combination of AWK and SED to do the necessary data cleanup, but time & date fields proved to be problematic. After an evening of steady efforts along those lines, a C++ data cleaning application was written as a stop gap measure. This C++ data cleaner provided a temporary means to separate the 2014 projects records from the csv file and get the rest of the team started with the data. The source code for the intermediate c++ application, was named rm_old. Toggle the link below to see the source.
back to top
Show or hide source code for rm_old data cleaner

back to top

Data Wrangling in R

During analysis of the data files retrieved from the Donor’s Choose website, we noticed that it contained string fields with special characters that caused errors when using the standard read utilities for comma delimited files. The first plan was to read files line by line and use “gsub” to removed unnecessary characters, but this turned out very slow. We then tried to run the UNIX utility “sed” within the “pipe” function, but the special characters like ctrl-M made that problematic. The final iteration still used “sed” by was run as a system call from R. Each file required slight variations in the search patterns so different _clnup.sed scripts were created for each.

Our final data wrangling consisted of the following steps:

  • Data download from the website,
  • Cleanup via “sed” with a system call
  • Extraction of CSV as a dataframe
  • Filtering out projects not posted in 2014
  • Storing cleaned dataframes as RDS files
Show or hide source code succeeding R data cleaner

back to top

Step 2: Data Upload from disk

A simple function was written to upload the cleaned data from RDS files stored on disk. Once the data has been put to disk locally, this function is all that’s needed to load the data objects. All data exploration and analysis was run on the uploaded data.

Show or hide source code for data load from disk

back to top

Exploratory Analysis

We began an exploration of the data to see what relationships might be discovered within it, to compare the completed and expired projects.

#add column to project df called percent_funded that tells the percent money given over asked for
projects <- projects %>% mutate(percent_funded = (total_donations / total_price_including_optional_support))

#test for normalacy of percent funded
hist(projects$percent_funded)

qqnorm(projects$percent_funded)

print("Data is nowhere near normal, looks like logistic analysis of funded vs non-funded make much more sense.")
## [1] "Data is nowhere near normal, looks like logistic analysis of funded vs non-funded make much more sense."
print("Will not look at different factors contained in the projects file to determine if they affect the likliehood of getting funded. There are techinically three outcomes for each request. Complete, means reached or succeeded funding goal. Expired, time ran out wihtout reaching goal. Reallocated, Did not reach goal, but donors chose to give previously pledge amount to a different proposal.")
## [1] "Will not look at different factors contained in the projects file to determine if they affect the likliehood of getting funded. There are techinically three outcomes for each request. Complete, means reached or succeeded funding goal. Expired, time ran out wihtout reaching goal. Reallocated, Did not reach goal, but donors chose to give previously pledge amount to a different proposal."
#poverty level graph
poverty_level_graph<-projects %>% group_by(poverty_level, funding_status) %>% 
  summarise(total= n()) %>% 
  group_by(poverty_level) %>% 
  mutate(totals = sum(total), percent = total/totals) %>% 
  ungroup %>% 
  ggplot(aes(poverty_level, percent)) + geom_point(aes(color = funding_status)) + theme(
    axis.text.x=element_text(angle=90, size=8)) + ylab("Final Status")
poverty_level_graph

#latitude and longitude level graph by amount of goal
latitude_graph<-projects %>% ggplot(aes(x = school_latitude, y = percent_funded)) + geom_point() + ylab("Final Status")
longitude_graph<-projects %>% ggplot(aes(x = school_longitude, y = percent_funded)) + geom_point() + ylab("Final Status")
latitude_graph

longitude_graph

#state graph
state_graph<-projects %>% group_by(school_state, funding_status) %>% 
  summarise(total= n()) %>% 
  group_by(school_state) %>% 
  mutate(totals = sum(total), percent = total/totals) %>% 
  ungroup %>% 
  ggplot(aes(school_state, percent)) + geom_point(aes(color = funding_status)) + theme(
    axis.text.x=element_text(angle=90, size=8)) + ylab("Final Status")
state_graph

#school urban vs rual
urban_vs_rural <-projects %>% group_by(school_metro, funding_status) %>% 
  summarise(total= n()) %>% 
  group_by(school_metro) %>% 
  mutate(totals = sum(total), percent = total/totals) %>% 
  ungroup %>% 
  ggplot(aes(school_metro, percent)) + geom_point(aes(color = funding_status)) + theme(
    axis.text.x=element_text(angle=90, size=8)) + ylab("Final Status")
urban_vs_rural

#teacher prefix, sort of a stand in for teacher gender
projects %>% group_by(teacher_prefix, funding_status) %>% 
  summarise(total= n()) %>% 
  group_by(teacher_prefix) %>% 
  mutate(totals = sum(total), percent = total/totals) %>% 
  ungroup %>% 
  ggplot(aes(teacher_prefix, percent)) + geom_point(aes(color = funding_status)) + ylab("Final Status")

#charter school
projects %>% group_by(school_charter, funding_status) %>% 
  summarise(total= n()) %>% 
  group_by(school_charter) %>% 
  mutate(totals = sum(total), percent = total/totals) %>% 
  ungroup %>% 
  ggplot(aes(school_charter, percent)) + geom_point(aes(color = funding_status)) + ylab("Final Status")

#magnent school
projects %>% group_by(school_magnet, funding_status) %>% 
  summarise(total= n()) %>% 
  group_by(school_magnet) %>% 
  mutate(totals = sum(total), percent = total/totals) %>% 
  ungroup %>% 
  ggplot(aes(school_magnet, percent)) + geom_point(aes(color = funding_status)) + ylab("Final Status")

#school year round
projects %>% group_by(school_year_round, funding_status) %>% 
  summarise(total= n()) %>% 
  group_by(school_year_round) %>% 
  mutate(totals = sum(total), percent = total/totals) %>% 
  ungroup %>% 
  ggplot(aes(school_year_round, percent)) + geom_point(aes(color = funding_status)) + ylab("Final Status")

#new Leaders New Schools affiliated 
projects %>% group_by(school_nlns, funding_status) %>% 
  summarise(total= n()) %>% 
  group_by(school_nlns) %>% 
  mutate(totals = sum(total), percent = total/totals) %>% 
  ungroup %>% 
  ggplot(aes(school_nlns, percent)) + geom_point(aes(color = funding_status)) + ylab("Final Status")

#knowledge is power public charter school (KIPP)
projects %>% group_by(school_kipp, funding_status) %>% 
  summarise(total= n()) %>% 
  group_by(school_kipp) %>% 
  mutate(totals = sum(total), percent = total/totals) %>% 
  ungroup %>% 
  ggplot(aes(school_kipp, percent)) + geom_point(aes(color = funding_status)) + ylab("Final Status")

#school charter ready promise
projects %>% group_by(school_charter_ready_promise, funding_status) %>% 
  summarise(total= n()) %>% 
  group_by(school_charter_ready_promise) %>% 
  mutate(totals = sum(total), percent = total/totals) %>% 
  ungroup %>% 
  ggplot(aes(school_charter_ready_promise, percent)) + geom_point(aes(color = funding_status)) + ylab("Final Status")

#teach for america teacher
projects %>% group_by(teacher_teach_for_america, funding_status) %>% 
  summarise(total= n()) %>% 
  group_by(teacher_teach_for_america) %>% 
  mutate(totals = sum(total), percent = total/totals) %>% 
  ungroup %>% 
  ggplot(aes(teacher_teach_for_america, percent)) + geom_point(aes(color = funding_status)) + ylab("Final Status")

#ny teaching fellow
projects %>% group_by(teacher_ny_teaching_fellow, funding_status) %>% 
  summarise(total= n()) %>% 
  group_by(teacher_ny_teaching_fellow) %>% 
  mutate(totals = sum(total), percent = total/totals) %>% 
  ungroup %>% 
  ggplot(aes(teacher_ny_teaching_fellow, percent)) + geom_point(aes(color = funding_status)) + ylab("Final Status")

#primary focus subject
projects %>% group_by(primary_focus_subject, funding_status) %>% 
  summarise(total= n()) %>% 
  group_by(primary_focus_subject) %>% 
  mutate(totals = sum(total), percent = total/totals) %>% 
  ungroup %>% 
  ggplot(aes(primary_focus_subject, percent)) + geom_point(aes(color = funding_status)) + theme(
    axis.text.x=element_text(angle=90, size=8)) + ylab("Final Status")

#primary focus area
projects %>% group_by(primary_focus_area, funding_status) %>% 
  summarise(total= n()) %>% 
  group_by(primary_focus_area) %>% 
  mutate(totals = sum(total), percent = total/totals) %>% 
  ungroup %>% 
  ggplot(aes(primary_focus_area, percent)) + geom_point(aes(color = funding_status)) + theme(
    axis.text.x=element_text(angle=90, size=8)) + ylab("Final Status")

#resource type
projects %>% group_by(resource_type, funding_status) %>% summarise(total= n()) %>% 
  group_by(resource_type) %>% mutate(totals = sum(total), percent = total/totals) %>% ungroup %>% 
  ggplot(aes(resource_type, percent)) + geom_point(aes(color = funding_status)) + theme(
    axis.text.x=element_text(angle=90, size=8)) + ylab("Final Status")

#grade level
projects %>% group_by(grade_level, funding_status) %>% 
  summarise(total= n()) %>% 
  group_by(grade_level) %>% 
  mutate(totals = sum(total), percent = total/totals) %>% 
  ungroup %>% 
  ggplot(aes(grade_level, percent)) + geom_point(aes(color = funding_status)) + theme(
    axis.text.x=element_text(angle=90, size=8))  + ylab("Final Status")

#total price excluding optional support
projects %>% ggplot(aes(x = total_price_excluding_optional_support, y = percent_funded)) + geom_smooth()  + ylab("Final Status")

#Might be a good idea to make a histogram of total price, then regraph with just the lower price

#students reached
projects %>% ggplot(aes(x = students_reached, y = percent_funded)) + geom_smooth() + ylab("Final Status")

#eligible for double your impact match
projects %>% group_by(eligible_double_your_impact_match, funding_status) %>% 
  summarise(total= n()) %>% 
  group_by(eligible_double_your_impact_match) %>% 
  mutate(totals = sum(total), percent = total/totals) %>% 
  ungroup %>% 
  ggplot(aes(eligible_double_your_impact_match, percent)) + geom_point(aes(color = funding_status)) + theme(
    axis.text.x=element_text(angle=90, size=8)) + ylab("Final Status")

#eligible for almost home match
projects %>% group_by(eligible_almost_home_match, funding_status) %>% 
  summarise(total= n()) %>% 
  group_by(eligible_almost_home_match) %>% 
  mutate(totals = sum(total), percent = total/totals) %>% 
  ungroup %>% 
  ggplot(aes(eligible_almost_home_match, percent)) + geom_point(aes(color = funding_status)) + theme(
    axis.text.x=element_text(angle=90, size=8))  + ylab("Final Status")

#date posted
projects %>% ggplot(aes(x = date_posted, y = percent_funded)) + geom_smooth() + ylab("Final Status")

Building a Prediction Model for Proposal Success

The first attempt to build a model was to use the glm method in the train function and select several variables from the earlier exploratory analysis that looked to have an effect. These variables included: amount of money asked for, school state, primary focus, primary subject, resource type, date posted. We used glm because our outcome is either funded or not funded and glm works well for logistic regression.

## Generalized Linear Model 
## 
## 170326 samples
##     45 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 170320, 170320, 170320, 170320, 170320, 170320, ... 
## Resampling results:
## 
##   RMSE       Rsquared  
##   0.4396291  0.09170007
## 
## 

The RMSE from this method is not great, but it is a start so we tried to improve on it. First we filtered out some of the uncommon variables, that appeared to have an effect in the exploratory analysis. For example if the teacher was in teacher in teach for America, a New York teach fellow, was it a charter school, or other school types. All total these only filtered out few percent of the applications, and are factors the requesters can’t change so they are not useful in building a prediction model. We then ran the training again.

## Generalized Linear Model 
## 
## 133860 samples
##     45 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 133855, 133855, 133855, 133855, 133855, 133855, ... 
## Resampling results:
## 
##   RMSE       Rsquared 
##   0.4427254  0.0922961
## 
## 

Removing these proposals did show an effect in an improvement of RMSE, even though it is . The next approach was that perhaps our model was trying to fit to many parameters, so we pared it down to only include the factors that the exploratory analysis showed to have the greatest effect. These factors were: cost, date of posting, and resource type.

## Generalized Linear Model 
## 
## 133860 samples
##     45 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 133860, 133860, 133860, 133860, 133860, 133860, ... 
## Resampling results:
## 
##   RMSE       Rsquared  
##   0.4469941  0.07528859
## 
## 

This shows an even greater improvement in RMSE. In looking back at the data it was clear that the cost requested has the strongest effect. So we decided to run train again to see if that variable alone might improve our accurary (as shown by RMSE).

## Generalized Linear Model 
## 
## 133860 samples
##     45 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 133860, 133860, 133860, 133860, 133860, 133860, ... 
## Resampling results:
## 
##   RMSE       Rsquared 
##   0.4478779  0.0709457
## 
## 

So using all the data, except the contents of the essay, provided by the requesters it appears that the best model we can build is using the total cost as a predictor.

back to top

Text Analysis

Word Selection Analysis

As part of the data available from the DonorsChoose website, we were able to download the full text of the teacher-written requests accompanying all classroom projects. The heart of each teacher’s classroom project request is their written request, and we wanted to use text mining to see if we could extract information from the essays to predict which projects get funded. The two questions we chose to answer were:

  • Does the sentiments expressed in the essay help predict whether it gets funded completed or not?
  • Are there specific words that appear in the essays for projects that are fully funded, that don’t appear in the incompletely funded ones?

Sentiment Analysis

We started by tokenizing the essay words and then using the sentiment lexicon from “nrc” to assign sentiments to the words. We then computed the percentage occurance for the sentiments in essays grouped by the funding status of the projects and compared the difference betweed “completed” and “expired” projects. Show or hide source code for sentiment analysis
# Plot the sentiment frequency and seperate by funding status
essays_sentiment_freq %>%
  filter(funding_status != "reallocated") %>%
  ggplot(aes(x=sentiment, y = occurance_pct, fill = sentiment)) +
  geom_bar(stat="identity") +
  facet_grid(~funding_status) +
  theme(text = element_text(size = 10),
        title = element_text(size = 12),
        legend.key.size = unit(0.5, "cm"),
        axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  ggtitle('Sentiment Occurance% in Funding Essays') +
  coord_flip()
## Warning: Removed 2 rows containing missing values (position_stack).

We can see from the graph that both positive and negitive sentiments occur at an almost identical rate completed funded as well as expired projects. We can see the minor differences more clearly in the data frame.

## Source: local data frame [11 x 3]
## 
##       sentiment  completed    expired
##           (chr)      (dbl)      (dbl)
## 1         anger  1.1026375  1.0838772
## 2  anticipation  5.6380648  5.5864252
## 3       disgust  0.4495098  0.4131162
## 4          fear  1.2389231  1.2130671
## 5           joy  5.9211053  5.7905396
## 6      negative  2.3453001  2.2955258
## 7      positive 17.0967005 17.3146193
## 8       sadness  1.4367613  1.3943689
## 9      surprise  2.0834330  2.0215808
## 10        trust  8.3226758  8.4123286
## 11           NA 54.3648887 54.4745512

Word Analysis

We then analyzed the occurance of specific words in the essays for insights. We started by looking at the top 10 words for both fully funded and expired projects.

Show or hide source code for Word Analysis
grid.arrange(p1, p2, nrow=1)

The top 10 words for both categories were almost identical. The only thing that stood out a little was that “excited” was in the top ten for completed funded projects versus “technology” for expired projects. This could be related to the fact that technology is more expensive and thus there is a lower chance of the project getting fully funded.

We removed the common words for both project categories for the top 10 words to focus of the rest of the words.

We then used wordclouds to compare the words that occur the most in both funding categories. We started with high frequency words in essays for projects that were fully funded.

Fig 1: Wordcloud for

Fig 1: Wordcloud for “completed” project essays

Then we compared it with the high frequency words in essays for projects that expired and weren’t fully funded.

Fig 2: Wordcloud for

Fig 2: Wordcloud for “expired” project essays

Looking at the wordcloud, one of the things that stood out were the higher relative prominence of “books” for funded projects versus “technology” for expired projects.

back to top

Essay Length Analysis: Comparing the word count for completed vs expired projects

We took a look to see, does the length that an essay has in terms of word count matter?

Show or hide source code for data load from disk
kable(word_count_summaries, digits = 0, caption = "Essay Word Counts")
Essay Word Counts
project_count_2014 170326
completed_project_count_2014 118039
completed_project_mean_word_count 302
completed_project_sd_word 84
expired_project_count_2014 51246
expd_total_word_sums_count 51245
expd_total_word_sums_mean 305
expd_total_word_sums_sd 86

back to top

Essay word count comparison results

Does essay word count matter?

Here we look at the number of words in essays, to see if there is any significant difference between the number of words in completed and expired essays.

## [1] "On average, completed essays had essay word counts that were 2.9918 shorter than expired ones"

The length of essay, in terms of word count, does not seem to matter much all by itself.

back to top

Final Analysis

After examining the data, we were able to draw some conclusions. Of all of the predictors, the total project cost was found to the be most significant.